Second report of lyrics dataset

After preprocessing and analyzing our dataset in the first report, we will now create a model. Our model will predict the genre and interpreter based on lyrics.

First, we will enhance preprocessing with additional transformations and add language prediction as a useful feature. Then we will try a few different models and find out which one performs the best. Then we will see the performance of the model.

  1. Preprocessing: featurization
  2. Model creation
  3. Model results
  4. Conclusion

This time, preprocessing and modelling have been moved to separate notebooks, so you can see them if you are interested in the code.

1. Preprocessing: featurization

We need to prepare our data for modelling. We will create 2 models: genre prediction and interpreter prediction prediction.

First, we will prepare the data as in the first dataset - we will drop songs with no lyrics, songs without song name or release year or suspicious values of it. We will also try to merge interpreters that seem to just have the same name written differently, just like before. We will also create word count, unique word count and average word lenth as in the first report. TF-IDF features will be added later in modelling stage.

Since we are just interested in some simple model and we have a very large dataset, we will create a universal dataset for both models.

Genres contain 'Not Available' category that will not be very consistent and useful so we will drop all rows with this category.

The main problem for interpreter recognition will be that there are too many of them. The best solution is to keep only those with the most of songs. So I decided to only keep those that have more than 200 songs (but I also created some models with the threshold at 100). We could also keep interpreters with not many songs in "other" category, but there would be many of them and it probably would not be very useful. Since we have a large dataset, we can afford to make it smaller.

Last, we will use a language model to identify the language of each row. Since there is a lot of slang in individual songs, it can make many mistakes and if we were to productionalize the model, it would be useful to limit the possible languages. But since we are just creating something simple we can keep it as is and count with the fact that wrong language recognition can also be useful since it still says something about the style of the song.

You can see the preview of the resulting dataset below.

song_name year interpreter genre lyrics word_count unique_word_count average_word_length language
0 Without You 1968 Fleetwood Mac Rock I'm crazy for my baby But my baby she don't lo... 107 27 3.542056 en
1 The Width Of A Circle 1970 David Cook Rock In the corner of the morning in the past I wou... 320 187 3.900000 en
2 All The Madmen 1970 David Cook Rock Day after day They send my friends away To man... 323 125 3.835913 en
3 Awaiting On You All 1970 George Harrison Rock You don't need no love in You don't need no be... 284 105 3.845070 en
4 Good Vibrations 1970 Beach Boys Rock I, I love the colorful clothes she wears And t... 223 77 4.488789 en
... ... ... ... ... ... ... ... ... ...
46433 Believe Me Now Album Version 2016 Electric Light Orchestra Rock Can you hear me? Ahhhhhh, Something, something... 22 18 4.727273 en
46434 Dreamin Home Quadrophonic Mix 2016 Chicago Rock Dream, dream, dream Dreams in the night Take m... 22 21 4.227273 en
46435 Now More Than Ever Quadrophonic Mix 2016 Chicago Rock Now I need you More than ever No more cryin' W... 20 18 3.800000 en
46436 Built This Pool 2016 Blink 182 Rock UUH UUH UUH UUH UUH UUH UUH UUH I want to see ... 21 14 3.476190 en
46437 Brohemian Rhapsody 2016 Blink 182 Rock There's something about you I can't quite put ... 11 11 4.454545 en

46438 rows × 9 columns

The dataset has tens of thousands of rows, so there is still enough data for a model.

You can also see below how many interpreters we have in the dataset. There is a lot of them, however after trying various song number thresholds, it seems that the model performance is not much worse with this number.

Number of unique interpreters: 143
interpreter
Doc Watson       1182
Ben Lee           714
Elton John        700
David Cook        625
Chris Brown       625
                 ... 
Daddy Yankee      204
Black Sabbath     203
Cam Ron           203
Bonnie Raitt      201
Blink 182         201
Name: count, Length: 143, dtype: int64

In the following image, you can see the distribution of the recognized languages. Since there is much more English songs than other, the scale is logarithmic. The least represented languages might sometimes be mistakes, however the most represented ones seem to correspond to our previous observation.

Genre representation has also changed with our filters as you can see in the following plot. There is also more rock songs than any other and the representation is imbalanced, so I have also used a logarithmic scale.

2. Model creation

Based on the information from the Data Science classes, I have decided to try logistic regression as the base model and then also try Multinomial Naive Bayes classificator. Based on information from Machine Learning for Greenhorns, I also wanted to try SVC and a simple MLP. However, I only trained a smaller model on them because it was slower and not much better.

I have tried training with various thresholds for the minimal number of songs of an interpreter. The models get worse when I keep more interpreters, but their results seem more interesting. I have also tried GridSearchCV, but it did not affect the results much so in the end I decided to just keep the defaults.

I have used MLflow to track my models. You can see the summary of the latest, most representable results below. The first results are for the threshold at 200.

mlflow.runName accuracy top3_accuracy average_cross_val_score f1 precision recall
0 Interpreter Logistic Regression Model NaN NaN NaN NaN NaN NaN
1 Genre Logistic Regression Model 0.596123 0.897792 0.596274 0.506799 0.515848 0.596123
2 Interpreter Multinomial Naive Bayes Model 0.097900 0.207647 0.096152 0.063723 0.110193 0.097900
3 Genre Multinomial Naive Bayes Model 0.582014 0.898008 0.581518 0.470585 0.489658 0.582014

The accuracies of individual models are not very high. But that is rather because of the large number of possible targets. That's why we also have a top3 and top5 accuracy metric which measures how often the correct class is in the 3 or 5 most probable ones. Note that top5 mostly makes sense for interpreter which has hundreds of categories since there are just 8 categories in genre. But considering we have hundreds of interpreters, identifying the correct one with 10% probability and the interpreter being in top5 having even a much larger probability is not a bad result. We can confidently say that genre and interpreter can be predicted - however usable results would probably require more preprocessing and model tuning or predicting just on a smaller number of interpreters.

We can see that Multinomial Naive Bayes has the best top3/top5 accuracy, however for interpreter, it has quite low precision. But if our goal is to have reasonable candidates for interpreter that we can verify ourselves, that might be the best option. Surprisingly, for interpreter identification, Logistic regression was the best model and for genre identification it was as good as Naive Bayes. However, its accuracy does not improve as much as for the other two models when we consider top3 or top5.

3. Model results

We will now see the details of the results of the Multinomial Naive Bayes classifier with the threshold at 200.

Below you can see some predictions the Logistic Regression makes for genre on the test set along with the correct target. I have chosen genre for illustration since we would probably need to study more samples if we wanted to understand the results of the interpreter model.

year word_count unique_word_count average_word_length alone always another around away baby ... language_sk language_sl language_so language_sq language_sv language_sw language_tl language_tr genre genre_predictions
0 2006 0.044190 0.072414 0.185318 0.000000 0.0 0.0 0.000000 0.0 0.000000 ... False False False False False False False False Rock Rock
1 2006 0.190977 0.255172 0.160432 0.000000 0.0 0.0 0.000000 0.0 0.161352 ... False False False False False False False False Hip-Hop Hip-Hop
2 2010 0.044190 0.049138 0.126603 0.000000 0.0 0.0 0.000000 0.0 0.000000 ... False False False False False False False False Rock Rock
3 2006 0.060569 0.101724 0.184924 0.824323 0.0 0.0 0.086162 0.0 0.162656 ... False False False False False False False False Rock Rock
4 2006 0.030284 0.061207 0.207603 0.000000 0.0 0.0 0.000000 0.0 0.000000 ... False False False False False False False False Rock Rock
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9280 2008 0.054388 0.087069 0.149341 0.000000 0.0 0.0 0.000000 0.0 0.000000 ... False False False False False False False False Rock Rock
9281 2013 0.058405 0.110345 0.253769 0.000000 0.0 0.0 0.000000 0.0 0.000000 ... False False False False False False False False Rock Rock
9282 2006 0.095488 0.187931 0.202123 0.000000 0.0 0.0 0.000000 0.0 0.092973 ... False False False False False False False False Rock Hip-Hop
9283 1995 0.043881 0.075862 0.169840 0.000000 0.0 0.0 0.000000 0.0 0.000000 ... False False False False False False False False Rock Rock
9284 2005 0.040482 0.058621 0.133546 0.000000 0.0 0.0 0.000000 0.0 0.000000 ... False False False False False False False False Rock Rock

9285 rows × 132 columns

We can see in our example that most songs are Rockand they are mostly classified well. However there is one song misclassified as Hip-Hop.

Next, you can see the percentage of samples correctly classified for each genre.

This plot is rather bad news - our model only learned to recognize certain genres and it mostly guesses Rock or Hip-Hop. However that's no big surprise. For those results to be better, we would have to have a much larger representation for those other genres. We would definitely need to work on that in out further work if we wanted this model to be good.

Summary

After preprocessing analyzing the data in the first report, we have addded some further preprocessing and featurization. Then we tried to train several models including logistic regression, Naive Bayes and SVC.

We have found out that all our models can predict the genre well enough and the predictions could be made even better with some further model tuning. However, predicting the interpreter when there are hundreds is a complex task, so our model were not able to achieve more than 10% accuracy. But considering that the correct interpreter was among the top 5 about 30% of the time, we can still consider such a model to be effective.

Acknowledgments: In the process of conducting the data analysis and building the models for this report, I utilized GitHub Copilot, an AI programming assistant powered by OpenAI's GPT-4 model. This tool provided assistance with code generation, debugging, and answering various programming-related queries. And with writing this acknowledgement. However, the rest of text is fully my work.